from IPython.display import HTML
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
HTML('''<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js "></script><script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
} else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);</script><form action="javascript:code_toggle()"><input type="submit" value="Toggle on/off for raw code"></form>
''')
%%HTML
<script src="require.js"></script>
INTRODUCTION¶
This study aims to investigate the efficacy of regression algorithms, including Linear Regression, RandomForest, and ARIMA (AutoRegressive Integrated Moving Average), in predicting trends and volatility in cryptocurrencies, specifically Bitcoin and Ethereum. The motivation behind this exploration stems from the increasing interest in cryptocurrency markets and the need for accurate predictive models to assist investors and traders in decision-making.
PROBLEM STATEMENT¶
Cryptocurrency has become incredibly popular and easy to get into, attracting a wide range of people interested in investing and trading. Its decentralized nature and low entry barriers have made it accessible to almost anyone, promising high returns with a high risk high rewards nature. However, accurately predicting cryptocurrency trends is quite challenging, despite its widespread popularity.
For investors and traders to make informed decisions and manage risks in the cryptocurrency market, accurate forecasting is crucial. This is where predictive models come into play. By using simple regression algorithms like Linear Regression, RandomForest, and time-series analysis methods like ARIMA, we can better understand the trends of cryptocurrencies.
HIGHLIGHTS¶
In machine learning, the complexity of a model doesn't guarantee better performance. Simple models like linear regression should be carefully considered as they can outperform more intricate ones.
Bitcoin and Ethereum are characterized by significant price fluctuations (volatility), where recent price movements notably impact the next price. Over time, the impact of past patterns decreases, suggesting that seasonal or cyclical trends don't persist.
While a model may display a low Mean Absolute Error (MAE), focusing solely on this metric may not capture its practical effectiveness. Even with a low MAE the model fails to predict the volatile price fluctuations which is crucial when investing in crypto.
# IMPORT LIBRARIES
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from statsmodels.tsa.arima.model import ARIMA
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_error, mean_absolute_error
from math import sqrt
from itertools import product
import time
import warnings
warnings.filterwarnings("ignore")
DATA¶
The dataset was sourced from kaggle: Bitcoin & Ethereum prices (2014-2024) and compiled from various open trading market sources.
Description
It includes daily prices of Bitcoin (BTC) and Ethereum (ETH) spanning from 2014-09-18 to 2024-01-21 and 2017-11-10 to 2024-01-21 respectively.
Columns
- Date: The recorded date.
- Open: Opening price of the cryptocurrency.
- High: Highest price during the trading period.
- Low: Lowest price during the trading period.
- Close: Closing price of the cryptocurrency.
- Adj Close: Adjusted closing price.
- Volume: Trading volume of the cryptocurrency.
BITCOIN DATA¶
btc_df = pd.read_csv('BTC-USD (2014-2024).csv')
btc_df.set_index('Date', inplace=True)
btc_df.head(10)
| Open | High | Low | Close | Adj Close | Volume | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2014-09-18 | 456.859985 | 456.859985 | 413.104004 | 424.440002 | 424.440002 | 34483200.0 |
| 2014-09-19 | 424.102997 | 427.834991 | 384.532013 | 394.795990 | 394.795990 | 37919700.0 |
| 2014-09-20 | 394.673004 | 423.295990 | 389.882996 | 408.903992 | 408.903992 | 36863600.0 |
| 2014-09-21 | 408.084991 | 412.425995 | 393.181000 | 398.821014 | 398.821014 | 26580100.0 |
| 2014-09-22 | 399.100006 | 406.915985 | 397.130005 | 402.152008 | 402.152008 | 24127600.0 |
| 2014-09-23 | 402.092010 | 441.557007 | 396.196991 | 435.790985 | 435.790985 | 45099500.0 |
| 2014-09-24 | 435.751007 | 436.112000 | 421.131989 | 423.204987 | 423.204987 | 30627700.0 |
| 2014-09-25 | 423.156006 | 423.519989 | 409.467987 | 411.574005 | 411.574005 | 26814400.0 |
| 2014-09-26 | 411.428986 | 414.937988 | 400.009003 | 404.424988 | 404.424988 | 21460800.0 |
| 2014-09-27 | 403.556000 | 406.622986 | 397.372009 | 399.519989 | 399.519989 | 15029300.0 |
btc_df.shape
(3413, 6)
btc_df.describe()
| Open | High | Low | Close | Adj Close | Volume | |
|---|---|---|---|---|---|---|
| count | 3412.000000 | 3412.000000 | 3412.000000 | 3412.000000 | 3412.000000 | 3.412000e+03 |
| mean | 14747.360368 | 15091.809098 | 14376.126435 | 14758.111980 | 14758.111980 | 1.663026e+10 |
| std | 16293.633702 | 16683.948248 | 15855.901350 | 16295.374063 | 16295.374063 | 1.907607e+10 |
| min | 176.897003 | 211.731003 | 171.509995 | 178.102997 | 178.102997 | 5.914570e+06 |
| 25% | 921.790009 | 935.210266 | 908.876495 | 921.739258 | 921.739258 | 1.685530e+08 |
| 50% | 8288.819824 | 8464.720703 | 8108.011475 | 8285.438965 | 8285.438965 | 1.176004e+10 |
| 75% | 24345.831543 | 24986.300293 | 23907.724610 | 24382.675293 | 24382.675293 | 2.697648e+10 |
| max | 67549.734375 | 68789.625000 | 66382.062500 | 67566.828125 | 67566.828125 | 3.509679e+11 |
Central Tendency: The mean close price of Bitcoin over the specified period is approximately
$14,758.11, indicating the average value around which the close prices tend to cluster.Spread: The standard deviation of the close prices is relatively high at approximately
$16,295.37, suggesting significant variability or dispersion of the close prices from the mean.Minimum and Maximum: The minimum close price recorded for Bitcoin is approximately
$178.10, while the maximum close price is around$67,566.83, illustrating the wide range of price levels observed during the specified period.Quartiles: The 25th percentile (Q1) close price is approximately
$921.74, indicating that25%of the close prices fall below this value. Similarly, the 75th percentile (Q3) close price is approximately$24,382.68, indicating that75%of the close prices fall below this value.Median: The median close price (50th percentile or Q2) is approximately
$8,285.44, representing the middle value of the close prices when arranged in ascending order.
fig, axes = plt.subplots(1, 3, figsize=(20, 6))
# Plot 1: Time Series
sns.lineplot(x=btc_df.index, y=btc_df['Close'], data=btc_df, ax=axes[0])
axes[0].set_xticks(btc_df.index[::6*30])
axes[0].set_xticklabels(btc_df.index[::6*30])
axes[0].set_title("Close Price Time Series")
axes[0].tick_params(axis='x', rotation=45)
# Plot 2: Boxplot
btc_df["Close"].plot(kind="box", vert=False, title="Distribution of BTC Close Prices", ax=axes[1])
# Plot 3: Distribution Plot
sns.distplot(btc_df["Close"], kde=True, color='blue', bins=30, ax=axes[2])
axes[2].set_title("Distribution of BTC Close Prices")
plt.tight_layout()
plt.show()
fig, axes = plt.subplots(2, 2, figsize=(10, 10))
# Lag values
lags = [1, 7, 30, 90] # Reduce to fit in 2x2 grid
for i, ax in enumerate(axes.flat):
lag = lags[i]
ax.scatter(x=btc_df["Close"].shift(lag), y=btc_df["Close"])
ax.plot([btc_df["Close"].min(), btc_df["Close"].max()], [btc_df["Close"].min(), btc_df["Close"].max()], linestyle="--", color="red")
ax.set_xlabel(f"Close (lagged by {lag} days)")
ax.set_ylabel("Close")
ax.set_title(f"Autocorrelation of Close Prices ({lag}-day Lag)")
plt.tight_layout()
plt.show()
The observation regarding the strong autocorrelation at lag 1 suggests that there is a strong relationship between the current value of the time series (in this case, the close prices of Bitcoin) and its immediate past value. In other words, if the price of Bitcoin has been increasing or decreasing recently, it is likely to continue in the same direction in the short term. This phenomenon is commonly observed in financial time series data, where short-term trends or momentum tend to persist.
On the other hand, the weak autocorrelation at higher lags suggests that there is little persistence in any seasonality or cyclical patterns in the Bitcoin price data beyond the immediate past. This implies that any recurring patterns or cycles in Bitcoin prices tend to be short-lived or irregular, making them less predictable over longer time horizons.
Overall, the strong autocorrelation at lag 1 indicates the presence of short-term predictive power in the Bitcoin price data, while the weak autocorrelation at higher lags suggests that longer-term forecasting may be more challenging due to the lack of persistent seasonality or cyclical patterns.
METHODOLOGY¶
- Data Cleaning / Preprocessing
Ensuring data integrity by check and handle missing data.
Encode Features: Incorporating lagged values (since we are dealing time series data) of cryptocurrency prices to capture temporal dependencies.
- AutoML
Establishing baseline performance using Linear Regression, RandomForest, GradientBoost and DecisionTree models without additional enhancements.
Perform Grid search on ARIMA parameters
- Improving the Model
Feature Engineering: Enhancing models by incorporating measures of volatility to better capture the dynamic nature of cryptocurrency markets.
null_counts = btc_df.isnull().sum()
print(null_counts)
Open 1 High 1 Low 1 Close 1 Adj Close 1 Volume 1 dtype: int64
null_indices = btc_df[btc_df.isnull().any(axis=1)].index
print(null_indices)
Index(['2024-01-20'], dtype='object', name='Date')
The DataFrame btc_df contains null values in all columns for the date '2024-01-20'.
btc_df_sorted = btc_df.sort_values(by='Date', ascending=False)
btc_df_sorted.head()
| Open | High | Low | Close | Adj Close | Volume | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2024-01-21 | 41671.488281 | 41693.160156 | 41615.140625 | 41623.695313 | 41623.695313 | 1.127404e+10 |
| 2024-01-20 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2024-01-19 | 41278.460938 | 42134.160156 | 40297.457031 | 41618.406250 | 41618.406250 | 2.575241e+10 |
| 2024-01-18 | 42742.312500 | 42876.347656 | 40631.171875 | 41262.058594 | 41262.058594 | 2.521836e+10 |
| 2024-01-17 | 43132.101563 | 43189.890625 | 42189.308594 | 42742.652344 | 42742.652344 | 2.085123e+10 |
btc_df.drop(index=['2024-01-21', '2024-01-20'], inplace=True)
The last two rows from the DataFrame btc_df, corresponding to the dates '2024-01-21' and '2024-01-20'. This action is deemed acceptable since the dates to be removed are the final two entries in the dataset, thus ensuring the maintenance of chronological order without compromising the integrity of the data.
Encode Features¶
The autocorrelation plots indicate a notable positive autocorrelation at lag 1 for both Bitcoin closing prices. This signifies that the current closing price is strongly influenced by its immediate past value, suggesting a short-term predictive power of the previous day's price on the current day's price.
To implement this insight, a lag 1 close feature can be introduced into the dataset. This involves creating a new column named "Close_Lag1," where each entry represents the closing price from the previous day. By incorporating this lag 1 feature, the model gains valuable information about the recent trend in prices, potentially enhancing its ability to capture temporal dependencies and improve predictive accuracy.
df = btc_df.copy()
df["Close.L1"] = df["Close"].shift(1)
df.dropna(inplace = True)
df.head()
| Open | High | Low | Close | Adj Close | Volume | Close.L1 | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2014-09-19 | 424.102997 | 427.834991 | 384.532013 | 394.795990 | 394.795990 | 37919700.0 | 424.440002 |
| 2014-09-20 | 394.673004 | 423.295990 | 389.882996 | 408.903992 | 408.903992 | 36863600.0 | 394.795990 |
| 2014-09-21 | 408.084991 | 412.425995 | 393.181000 | 398.821014 | 398.821014 | 26580100.0 | 408.903992 |
| 2014-09-22 | 399.100006 | 406.915985 | 397.130005 | 402.152008 | 402.152008 | 24127600.0 | 398.821014 |
| 2014-09-23 | 402.092010 | 441.557007 | 396.196991 | 435.790985 | 435.790985 | 45099500.0 | 402.152008 |
Step 2: AutoML¶
Split train and test sets¶
The data has been split into feature and target variables, with "Close" designated as the target variable. The feature set, denoted as X, comprises all columns except the target variable. Subsequently, the dataset has been divided into training and testing sets, with an 80-20 split ratio. The training set (X_train, y_train) consists of the initial 80% of the data, while the testing set (X_test, y_test) comprises the remaining 20% of the data. This division facilitates model training on the training set and evaluation on the unseen testing set, allowing for the assessment of model performance on new data.
# Split the data into feature and target
target = "Close"
y = df[target]
X = df["Close.L1"]
#Split the data into train and test sets
cutoff = int(len(X) * 0.8)
X_train, y_train = X.iloc[:cutoff], y.iloc[:cutoff]
X_test, y_test = X.iloc[cutoff:], y.iloc[cutoff:]
Calculate Baseline¶
The mean score baseline model has been established by predicting the mean close price as the constant prediction for all instances in the training set. The mean close price for the training set is approximately $11508.07, with a baseline Mean Absolute Error (MAE) of approximately $11560.94. The Root Mean Squared Error (RMSE) for the baseline model, calculated using the mean close price of the training set as the prediction for all instances in the testing set, is approximately $18127.68.
These baseline serves as a reference point for evaluating the performance of more complex models, providing insight into the effectiveness of predictive models beyond simply predicting the mean.
# Mean Score Baseline
y_pred_baseline = [y_train.mean()] * len(y_train)
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)
print("Mean Close Prices:", round(y_train.mean(), 2))
print("Baseline MAE:", round(mae_baseline, 2))
Mean Close Prices: 11508.07 Baseline MAE: 11560.94
# RMSE Baseline
baseline_prediction = np.mean(y_train)
baseline_predictions = np.full_like(y_test, baseline_prediction)
baseline_rmse = np.sqrt(mean_squared_error(y_test, baseline_predictions))
print("Baseline RMSE:", round(baseline_rmse, 2))
Baseline RMSE: 18104.27
Model Baselines¶
Next, we proceed to train and evaluate various regression models to assess their effectiveness in predicting cryptocurrency prices. By iteratively training and evaluating each model, we aim to understand each model's performance by checking metrics like Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and runtime.
The idea is to establish baseline performance using Linear Regression, RandomForest, GradientBoost, and DecisionTree without additional enhancements. This initial step provides a fundamental benchmark for evaluating the effectiveness of more complex predictive models. By comparing the performance of these basic models, we can assess their predictive capabilities and identify areas for improvement.
def train_and_evaluate(model, X_train, y_train, X_test, y_test):
start_time = time.time()
model.fit(X_train, y_train)
training_mae = mean_absolute_error(y_train, model.predict(X_train))
test_mae = mean_absolute_error(y_test, model.predict(X_test))
training_rmse = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
test_rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
runtime = time.time() - start_time
return training_mae, test_mae, training_rmse, test_rmse, runtime
def run_baseline(model, X_train, y_train, X_test, y_test, num_tests=10):
results = []
for _ in range(num_tests):
training_mae, test_mae, training_rmse, test_rmse, runtime = train_and_evaluate(model, X_train, y_train, X_test, y_test)
results.append({'Model Name': type(model).__name__, 'MAE': test_mae, 'RMSE': test_rmse, 'Runtime': runtime})
return results
# Initialize models
models = [
RandomForestRegressor(),
LinearRegression(),
GradientBoostingRegressor(),
DecisionTreeRegressor()
]
results_all_models = []
# Train and evaluate each model
for model in models:
results = run_baseline(model, X_train.values.reshape(-1, 1), y_train, X_test.values.reshape(-1, 1), y_test)
results_all_models.extend(results)
# Convert results to DataFrame
df_results = pd.DataFrame(results_all_models)
# Compute averages
avg_results = df_results.groupby('Model Name').mean()
avg_results
| MAE | RMSE | Runtime | |
|---|---|---|---|
| Model Name | |||
| DecisionTreeRegressor | 1361.309005 | 1797.699143 | 0.010311 |
| GradientBoostingRegressor | 1030.262711 | 1315.233570 | 0.219155 |
| LinearRegression | 515.438167 | 792.148102 | 0.003849 |
| RandomForestRegressor | 1108.432802 | 1453.098120 | 0.626320 |
# # Function to perform ARIMA grid search
# def arima_grid_search(train_data, test_data, p_values, d_values, q_values):
# results = []
# for p, d, q in product(p_values, d_values, q_values):
# try:
# history = [x for x in train_data]
# predictions = []
# # walk-forward validation
# for t in range(len(test_data)):
# model = ARIMA(history, order=(p, d, q))
# model_fit = model.fit()
# output = model_fit.forecast()
# yhat = output[0]
# predictions.append(yhat)
# obs = test_data[t]
# history.append(obs)
# # evaluate forecasts
# mae = mean_absolute_error(test_data, predictions)
# rmse = sqrt(mean_squared_error(test_data, predictions))
# results.append((p, d, q, mae, rmse))
# except:
# continue
# return pd.DataFrame(results, columns=['p', 'd', 'q', 'MAE', 'RMSE'])
# # Define ARIMA parameters for grid search
# p_values = range(0, 6) # p values
# d_values = range(0, 2) # d values
# q_values = range(0, 2) # q values
# # Perform grid search
# grid_results = arima_grid_search(train, test, p_values, d_values, q_values)
# # Output results to DataFrame
# grid_results
Grid Search: ARIMA¶
Next, we perform a grid search to fine-tune the parameters of the ARIMA model. This process involves systematically evaluating different combinations of model parameters to identify the optimal configuration that yields the best performance. By leveraging grid search, we aim to enhance the predictive accuracy of the ARIMA model and improve its suitability for forecasting cryptocurrency prices.
| p | d | q | MAE | RMSE |
|---|---|---|---|---|
| 0 | 0 | 0 | 14632.688819 | 16611.304597 |
| 0 | 0 | 1 | 7445.398557 | 8543.820176 |
| 0 | 1 | 0 | 514.586990 | 791.797907 |
| 0 | 1 | 1 | 513.378288 | 791.093285 |
| 1 | 0 | 0 | 514.354121 | 791.915688 |
| 1 | 0 | 1 | 513.151961 | 791.228830 |
| 1 | 1 | 0 | 513.364870 | 791.057282 |
| 1 | 1 | 1 | 513.407777 | 791.106320 |
| 2 | 0 | 0 | 513.148681 | 791.200314 |
| 2 | 0 | 1 | 513.344987 | 791.352766 |
| 2 | 1 | 0 | 513.550905 | 791.103317 |
| 2 | 1 | 1 | 514.060564 | 791.342664 |
| 3 | 0 | 0 | 513.338597 | 791.221013 |
| 3 | 0 | 1 | 513.309885 | 791.278221 |
| 3 | 1 | 0 | 514.198221 | 790.639340 |
| 3 | 1 | 1 | 515.883490 | 790.754452 |
| 4 | 0 | 0 | 513.948599 | 790.733309 |
| 4 | 0 | 1 | 514.188689 | 791.002517 |
| 4 | 1 | 0 | 516.273217 | 791.916986 |
| 4 | 1 | 1 | 516.826047 | 792.005012 |
| 5 | 0 | 0 | 516.316086 | 792.012733 |
| 5 | 0 | 1 | 516.121262 | 791.976760 |
| 5 | 1 | 0 | 516.749558 | 792.156649 |
| 5 | 1 | 1 | 516.930870 | 792.185792 |
# grid_results_sorted = grid_results.sort_values(by=['MAE', 'RMSE'])
# grid_results_sorted.head(1)
| p | d | q | MAE | RMSE |
|---|---|---|---|---|
| 2 | 0 | 0 | 513.148681 | 791.200314 |
Based on the grid search we get the best parameters:
p (AR order): The value of 2 indicates that the model includes two lagged observations of the dependent variable (time series) as predictors. This means that the current value of the time series is modeled as a linear combination of its two most recent observations.
d (I order): The value of 0 implies that no differencing is required to make the time series stationary. In other words, the original time series is stationary or does not exhibit any trend or seasonality that needs to be removed through differencing.
q (MA order): The value of 0 suggests that the model does not include any lagged forecast errors (residuals) of the dependent variable as predictors. Therefore, the model does not explicitly capture any short-term dependencies beyond the lagged observations.
Given these parameter values, the ARIMA(2, 0, 0) model can be interpreted as follows:
The current value of the time series is linearly dependent on its two most recent observations (AR component). No differencing is required to make the time series stationary (I component), implying that the original time series already exhibits stationarity. The model does not incorporate any additional short-term dependencies beyond the lagged observations (MA component).
history = [x for x in y_train]
predictions = []
for t in range(len(y_test)):
model = ARIMA(history, order=(2, 0, 0))
model_fit = model.fit()
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
obs = y_test[t]
history.append(obs)
df_pred_test = pd.DataFrame(
{
"y_test": y_test,
"y_pred": predictions
}, index=btc_df.index[-len(predictions):]
)
df_last_30 = df_pred_test.iloc[-90:]
# Create the line plot with Plotly
fig = px.line(df_last_30, labels={"value": "Close Price"}, title="ARIMA(p=2, d=0, q=0) Model: Actual Prices vs. Predicted Prices (Last 90 Data Points)")
fig.show()
Determining the Best Model¶
The ARIMA model with parameters (p=2, d=0, q=0) achieved an MAE (Mean Absolute Error) of approximately 513.15 and an RMSE (Root Mean Squared Error) of approximately 791.20. On the other hand, the Linear Regression model achieved a slightly higher MAE of approximately 515.44 and a comparable RMSE of approximately 792.15.
Comparing the performance of these two models, we observe that the ARIMA model slightly outperformed the Linear Regression model in terms of MAE, indicating that it made slightly more accurate predictions. However, the difference in performance between the two models is very minimal, suggesting that both models are relatively similar in their predictive capabilities for this particular dataset.
The runtime for the Linear Regression model is approximately 0.008 seconds, whereas the ARIMA model took approximately 1 minute and 24.65 seconds to train and evaluate.
Given that no differencing is required to make the time series stationary, it implies that the original time series already exhibits stationarity. In such cases, Linear Regression may indeed be the preferred choice.
Linear Regression offers simplicity, interpretability, and computational efficiency, making it suitable for scenarios where the time series data does not exhibit complex temporal dependencies or nonlinear patterns. Additionally, the shorter runtime of Linear Regression compared to ARIMA further supports its practical applicability in scenarios where computational resources or runtime constraints are a concern.
df = btc_df.copy()
df["Close.L1"] = df["Close"].shift(1)
df.dropna(inplace = True)
# Split the data into featuer and target
target = "Close"
y = df[target]
X = df[["Close.L1"]]
#Split the data into train and test sets
cutoff = int(len(X) * 0.8)
X_train, y_train = X.iloc[:cutoff], y.iloc[:cutoff]
X_test, y_test = X.iloc[cutoff:], y.iloc[cutoff:]
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
test_mae = mean_absolute_error(y_test, predictions)
print("MAE:", test_mae)
MAE: 515.4381667582289
# Accessing the coefficients
coefficients = model.coef_
coefficients = coefficients.reshape(1, -1)
# Create a DataFrame to display the coefficients
coefficients_df = pd.DataFrame(coefficients, columns=X_train.columns, index=['Coefficient'])
coefficients_df
| Close.L1 | |
|---|---|
| Coefficient | 0.99926 |
df_pred_test = pd.DataFrame(
{
"y_test": y_test,
"y_pred": predictions
}
)
df_last_30 = df_pred_test.iloc[-90:]
# Create the line plot with Plotly
fig = px.line(df_last_30, labels={"value": "BTC Close Price"}, title="Linear Regression Model: Actual Prices vs. Predicted Prices (Last 90 Data Points)")
fig.show()
Step 3: Improving the Model¶
Feature Engineering¶
In an effort to enhance the predictive capabilities of our model, we've introduced additional features derived from the cryptocurrency price data. Specifically, we've calculated volatility measures at daily, weekly, monthly, and yearly intervals using rolling window standard deviations of the closing prices. These volatility measures capture the degree of price fluctuation within each respective time frame.
Additionally, we've incorporated lagged versions of the closing prices and volatility measures into the dataset. By shifting these features back by one and two time steps, we provide the model with historical information, enabling it to capture temporal dependencies and patterns in the data.
Furthermore, to ensure that our dataset remains consistent after introducing these new features, we've removed any rows containing missing values resulting from the lag operations.
df = btc_df.copy()
df['Volatility_Daily'] = df['Close'].rolling(window=2, min_periods=1).std()
df['Volatility_Weekly'] = df['Close'].rolling(window=7).std()
df['Volatility_Monthly'] = df['Close'].rolling(window=30).std()
df['Volatility_Yearly'] = df['Close'].rolling(window=365).std()
df["Close.L1"] = df["Close"].shift(1)
df["Close.L2"] = df["Close"].shift(2)
df['Volatility_Daily.L1'] = df['Volatility_Daily'].shift(1)
df['Volatility_Daily.L2'] = df['Volatility_Daily'].shift(2)
df['Volatility_Weekly.L1'] = df['Volatility_Weekly'].shift(1)
df['Volatility_Weekly.L2'] = df['Volatility_Weekly'].shift(2)
df['Volatility_Monthly.L1'] = df['Volatility_Monthly'].shift(1)
df['Volatility_Monthly.L2'] = df['Volatility_Monthly'].shift(2)
df['Volatility_Yearly.L1'] = df['Volatility_Yearly'].shift(1)
df['Volatility_Yearly.L2'] = df['Volatility_Yearly'].shift(2)
df.dropna(inplace = True)
df.head()
| Open | High | Low | Close | Adj Close | Volume | Volatility_Daily | Volatility_Weekly | Volatility_Monthly | Volatility_Yearly | Close.L1 | Close.L2 | Volatility_Daily.L1 | Volatility_Daily.L2 | Volatility_Weekly.L1 | Volatility_Weekly.L2 | Volatility_Monthly.L1 | Volatility_Monthly.L2 | Volatility_Yearly.L1 | Volatility_Yearly.L2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||||||||||
| 2015-09-19 | 232.858002 | 233.205002 | 231.089005 | 231.492996 | 231.492996 | 12712600.0 | 1.047939 | 1.250336 | 6.346485 | 57.053890 | 232.975006 | 229.809998 | 2.237999 | 0.508406 | 2.134827 | 3.998461 | 6.393035 | 6.437738 | 57.309662 | 57.745595 |
| 2015-09-20 | 231.399002 | 232.365005 | 230.910004 | 231.212006 | 231.212006 | 14444700.0 | 0.198690 | 1.261682 | 6.340692 | 56.710777 | 231.492996 | 232.975006 | 1.047939 | 2.237999 | 1.250336 | 2.134827 | 6.346485 | 6.393035 | 57.053890 | 57.309662 |
| 2015-09-21 | 231.216995 | 231.216995 | 226.520996 | 227.085007 | 227.085007 | 19678800.0 | 2.918229 | 1.890600 | 6.381719 | 56.432606 | 231.212006 | 231.492996 | 0.198690 | 1.047939 | 1.261682 | 1.250336 | 6.340692 | 6.346485 | 56.710777 | 57.053890 |
| 2015-09-22 | 226.968994 | 232.386002 | 225.117004 | 230.617996 | 230.617996 | 25009300.0 | 2.498200 | 1.894945 | 6.360249 | 56.120629 | 227.085007 | 231.212006 | 2.918229 | 0.198690 | 1.890600 | 1.261682 | 6.381719 | 6.340692 | 56.432606 | 56.710777 |
| 2015-09-23 | 230.936005 | 231.835007 | 229.591003 | 230.283005 | 230.283005 | 17254100.0 | 0.236874 | 1.817409 | 5.044624 | 55.570408 | 230.617996 | 227.085007 | 2.498200 | 2.918229 | 1.894945 | 1.890600 | 6.360249 | 6.381719 | 56.120629 | 56.432606 |
# Split the data into featuer and target
target = "Close"
y = df[target]
X = df[["Close.L1", "Close.L2",
"Volatility_Daily.L1",
"Volatility_Weekly.L1",
"Volatility_Monthly.L1",
"Volatility_Yearly.L1"]]
#Split the data into train and test sets
cutoff = int(len(X) * 0.8)
X_train, y_train = X.iloc[:cutoff], y.iloc[:cutoff]
X_test, y_test = X.iloc[cutoff:], y.iloc[cutoff:]
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
test_mae = mean_absolute_error(y_test, predictions)
print("MAE:", test_mae)
MAE: 458.71113525341354
# Accessing the coefficients
coefficients = model.coef_
coefficients = coefficients.reshape(1, -1)
# Create a DataFrame to display the coefficients
coefficients_df = pd.DataFrame(coefficients, columns=X_train.columns, index=['Coefficient'])
coefficients_df
| Close.L1 | Close.L2 | Volatility_Daily.L1 | Volatility_Weekly.L1 | Volatility_Monthly.L1 | Volatility_Yearly.L1 | |
|---|---|---|---|---|---|---|
| Coefficient | 0.967635 | 0.027756 | 0.067707 | 0.008924 | 0.020558 | 0.000455 |
df_pred_test = pd.DataFrame(
{
"y_test": y_test,
"y_pred": predictions
}
)
df_pred_test.head(10)
| y_test | y_pred | |
|---|---|---|
| Date | ||
| 2022-05-21 | 29432.226563 | 29280.383197 |
| 2022-05-22 | 30323.722656 | 29431.059319 |
| 2022-05-23 | 29098.910156 | 30328.908085 |
| 2022-05-24 | 29655.585938 | 29184.468628 |
| 2022-05-25 | 29562.361328 | 29655.061798 |
| 2022-05-26 | 29267.224609 | 29554.169498 |
| 2022-05-27 | 28627.574219 | 29274.018843 |
| 2022-05-28 | 28814.900391 | 28662.201573 |
| 2022-05-29 | 29445.957031 | 28800.726745 |
| 2022-05-30 | 31726.390625 | 29432.724339 |
df_last_30 = df_pred_test.iloc[-90:]
# Create the line plot with Plotly
fig = px.line(df_last_30, labels={"value": "BTC Close Price"}, title="Improved Linear Regression Model: Actual Prices vs. Predicted Prices (Last 90 Data Points)")
fig.show()
The Mean Absolute Error (MAE) has improved from approximately 515.44 to 458.71 after incorporating the additional features and lagged variables into the model. This reduction in MAE indicates that the model's predictive accuracy has improved, suggesting that the introduced enhancements have effectively captured additional information from the data.
We also build the same model for Ethereum (ETH) using the same features and parameters employed for the Bitcoin (BTC) data. This involves preprocessing the Ethereum dataset to incorporate additional features such as volatility measures and lagged variables.
eth_df = pd.read_csv('ETH-USD (2017-2024).csv')
eth_df.set_index('Date', inplace=True)
df = eth_df.copy()
df['Volatility_Daily'] = df['Close'].rolling(window=2, min_periods=1).std()
df['Volatility_Weekly'] = df['Close'].rolling(window=7).std()
df['Volatility_Monthly'] = df['Close'].rolling(window=30).std()
df['Volatility_Yearly'] = df['Close'].rolling(window=365).std()
df["Close.L1"] = df["Close"].shift(1)
df["Close.L2"] = df["Close"].shift(2)
df['Volatility_Daily.L1'] = df['Volatility_Daily'].shift(1)
df['Volatility_Daily.L2'] = df['Volatility_Daily'].shift(2)
df['Volatility_Weekly.L1'] = df['Volatility_Weekly'].shift(1)
df['Volatility_Weekly.L2'] = df['Volatility_Weekly'].shift(2)
df['Volatility_Monthly.L1'] = df['Volatility_Monthly'].shift(1)
df['Volatility_Monthly.L2'] = df['Volatility_Monthly'].shift(2)
df['Volatility_Yearly.L1'] = df['Volatility_Yearly'].shift(1)
df['Volatility_Yearly.L2'] = df['Volatility_Yearly'].shift(2)
df.dropna(inplace = True)
df.head()
| Open | High | Low | Close | Adj Close | Volume | Volatility_Daily | Volatility_Weekly | Volatility_Monthly | Volatility_Yearly | Close.L1 | Close.L2 | Volatility_Daily.L1 | Volatility_Daily.L2 | Volatility_Weekly.L1 | Volatility_Weekly.L2 | Volatility_Monthly.L1 | Volatility_Monthly.L2 | Volatility_Yearly.L1 | Volatility_Yearly.L2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||||||||||
| 2018-11-11 | 212.479004 | 212.998993 | 208.867996 | 211.339996 | 211.339996 | 1.501600e+09 | 0.843585 | 3.526736 | 5.753073 | 270.164345 | 212.533005 | 210.074005 | 1.738776 | 1.525228 | 4.083468 | 6.168682 | 5.839789 | 6.283667 | 269.871540 | 269.619064 |
| 2018-11-12 | 211.513000 | 212.623001 | 208.923996 | 210.417999 | 210.417999 | 1.452380e+09 | 0.651950 | 3.311558 | 5.732665 | 270.443820 | 211.339996 | 212.533005 | 0.843585 | 1.738776 | 3.526736 | 4.083468 | 5.753073 | 5.839789 | 270.164345 | 269.871540 |
| 2018-11-13 | 210.149002 | 210.514999 | 206.134995 | 206.826004 | 206.826004 | 1.610260e+09 | 2.539924 | 3.135081 | 5.420623 | 270.755271 | 210.417999 | 211.339996 | 0.651950 | 0.843585 | 3.311558 | 3.526736 | 5.732665 | 5.753073 | 270.443820 | 270.164345 |
| 2018-11-14 | 206.533997 | 207.044998 | 174.084000 | 181.397003 | 181.397003 | 2.595330e+09 | 17.981019 | 11.187730 | 6.989767 | 271.200449 | 206.826004 | 210.417999 | 2.539924 | 0.651950 | 3.135081 | 3.311558 | 5.420623 | 5.732665 | 270.755271 | 270.443820 |
| 2018-11-15 | 181.899002 | 184.251007 | 170.188995 | 180.806000 | 180.806000 | 2.638410e+09 | 0.417902 | 14.324449 | 8.201139 | 271.637530 | 181.397003 | 206.826004 | 17.981019 | 2.539924 | 11.187730 | 3.135081 | 6.989767 | 5.420623 | 271.200449 | 270.755271 |
# Split the data into featuer and target
target = "Close"
y = df[target]
X = df[["Close.L1", "Close.L2",
"Volatility_Daily.L1",
"Volatility_Weekly.L1",
"Volatility_Monthly.L1",
"Volatility_Yearly.L1"]]
#Split the data into train and test sets
cutoff = int(len(X) * 0.8)
X_train, y_train = X.iloc[:cutoff], y.iloc[:cutoff]
X_test, y_test = X.iloc[cutoff:], y.iloc[cutoff:]
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
test_mae = mean_absolute_error(y_test, predictions)
print("MAE:", test_mae)
MAE: 32.55243778097503
The Mean Absolute Error (MAE) for Ethereum (ETH) is impressively low at 32.55, indicating that the model's predictions are, on average, very close to the actual observed values. However, it's essential to consider the scale and variability of the Ethereum price data
eth_df.describe()
| Open | High | Low | Close | Adj Close | Volume | |
|---|---|---|---|---|---|---|
| count | 2263.000000 | 2263.000000 | 2263.000000 | 2263.000000 | 2263.000000 | 2.263000e+03 |
| mean | 1248.213140 | 1283.972388 | 1208.851543 | 1248.970441 | 1248.970441 | 1.205243e+10 |
| std | 1118.835543 | 1150.922648 | 1082.560829 | 1118.566081 | 1118.566081 | 1.012443e+10 |
| min | 84.279694 | 85.342743 | 82.829887 | 84.308296 | 84.308296 | 6.217330e+08 |
| 25% | 231.636727 | 236.766563 | 227.149369 | 231.901916 | 231.901916 | 4.845689e+09 |
| 50% | 1038.186646 | 1090.229980 | 956.325012 | 1039.099976 | 1039.099976 | 9.401190e+09 |
| 75% | 1870.983582 | 1905.373352 | 1844.880860 | 1871.952942 | 1871.952942 | 1.657259e+10 |
| max | 4810.071289 | 4891.704590 | 4718.039063 | 4812.087402 | 4812.087402 | 8.448291e+10 |
# Accessing the coefficients
coefficients = model.coef_
coefficients = coefficients.reshape(1, -1)
# Create a DataFrame to display the coefficients
coefficients_df = pd.DataFrame(coefficients, columns=X_train.columns, index=['Coefficient'])
coefficients_df
| Close.L1 | Close.L2 | Volatility_Daily.L1 | Volatility_Weekly.L1 | Volatility_Monthly.L1 | Volatility_Yearly.L1 | |
|---|---|---|---|---|---|---|
| Coefficient | 0.930871 | 0.063868 | 0.003381 | 0.05042 | -0.015873 | 0.007528 |
df_pred_test = pd.DataFrame(
{
"y_test": y_test,
"y_pred": predictions
}
)
df_pred_test.tail(10)
| y_test | y_pred | |
|---|---|---|
| Date | ||
| 2024-01-10 | 2582.103516 | 2337.871501 |
| 2024-01-11 | 2619.619141 | 2563.255347 |
| 2024-01-12 | 2524.460205 | 2614.551210 |
| 2024-01-13 | 2576.597900 | 2528.455932 |
| 2024-01-14 | 2472.241211 | 2570.218343 |
| 2024-01-15 | 2511.363770 | 2474.528096 |
| 2024-01-16 | 2587.691162 | 2502.872089 |
| 2024-01-17 | 2528.369385 | 2574.448874 |
| 2024-01-18 | 2467.018799 | 2523.999140 |
| 2024-01-19 | 2489.498535 | 2462.939432 |
df_last_30 = df_pred_test.iloc[-90:]
# Create the line plot with Plotly
fig = px.line(df_last_30, labels={"value": "ETH Close Price"}, title="Linear Regression Model: Actual Prices vs. Predicted Prices (Last 90 Data Points)")
fig.show()
RESULTS AND DISCUSSION¶
In machine learning, the complexity of a model doesn't guarantee better performance. Simple models like linear regression should be carefully considered as they can outperform more intricate ones. Our analysis supports this notion as Linear Regression emerged as the best-performing model, outperforming more complex alternatives like ARIMA. Tree-based regression methods showed promise, although their low effectiveness for time series forecasting was expected.
Bitcoin and Ethereum are characterized by significant price fluctuations (volatility), where recent price movements notably impact the next price. Over time, the impact of past patterns decreases, suggesting that seasonal or cyclical trends don't persist. This observation is supported by the autocorrelation analysis, highlighting the challenges in forecasting cryptocurrency prices accurately due to their dynamic and non-persistent nature.
While a model may display a low Mean Absolute Error (MAE), focusing solely on this metric may not capture its practical effectiveness. Even with a low MAE, the model fails to predict the volatile price fluctuations, which is crucial when investing in crypto. It is obviously seen in the graph where the model also has a lag in following the trend. For instance, the MAE for Bitcoin was 458.71, whereas for Ethereum, it was at 32.55. Despite the low MAE for both Bitcoin and Ethereum, the model may still struggle to predict extreme price movements accurately.
However, these models may not suffice due to several reasons. Firstly, cryptocurrency markets are highly volatile, with prices experiencing rapid and unpredictable fluctuations. ARIMA and linear regression models may struggle to capture and forecast such extreme movements effectively. Additionally, cryptocurrency price series often exhibit non-stationary behavior, meaning that their statistical properties change over time. Traditional time series models like ARIMA typically assume stationarity, leading to inaccurate forecasts. Moreover, cryptocurrency prices are influenced by a wide range of factors, including investor sentiment, market speculation, technological developments, regulatory changes, and macroeconomic events. Traditional models may not adequately capture these complex interactions.
To address these limitations, we could explore smaller granularity data (e.g., hourly or minute-level data) and incorporating fundamental factors as features into the model. By considering a broader range of variables and refining the granularity of the data, we can potentially improve the accuracy and robustness of cryptocurrency price forecasts.